Goto

Collaborating Authors

 optimization and generalization


Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions

Neural Information Processing Systems

We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks with quadratic activation function in the overparametrized regime where the layer width m is larger than the input dimension d. We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width m*<=m. We describe how the empirical loss landscape is affected by the number n of data samples and the width m* of the teacher network. In particular we determine how the probability that there be no spurious minima on the empirical loss depends on n, d, and m*, thereby establishing conditions under which the neural network can in principle recover the teacher. We also show that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error, i.e. it enables recovery in practice. Finally we characterize the time-convergence rate of gradient descent in the limit of a large number of samples. These results are confirmed by numerical experiments.


Review for NeurIPS paper: Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions

Neural Information Processing Systems

Reviews for this paper are mitigated, in particular some reviewers were concerned about some missing proofs. On the other hand, the paper studies an important problem and carries a nice analysis that integrates numerical experiments, heuristic derivations and rigorous proofs in a meaningful way; and the reader learns a lot about such models (quadratic 2-layer networks with sparse teacher). It is thus necessary that the authors spend a lot of effort writing the missing proofs thoroughly because it will not be possible to review those proofs again (and of course all the other changes proposed in the rebuttal should be implemented). Overall, for such a paper that contains true statements, conjectures and heuristics, it is very important to emphasize on the "truth status" of each statement, and "true statements" should have a proof.


Review for NeurIPS paper: Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions

Neural Information Processing Systems

For random initialization, I also believe that it still needs a lot of effort. The upper bound of E(A(t)) is clearly dependent on the condition number of A(0) instead of simply dividing the cases into full-rank and rank-deficient. Moreover, rather than only focusing on the full-rank case, the author may consider the problem uniformly and continuously, e.g., the MP-law from RMT may help to provide an asymptotic analysis for the random initialization since the universal distribution for the eigenvalues are given. Also, there may exist the non-asymptotic version, but more perturbation bounds are needed. BTW, due to my research background, I neglected the development of shallow neural networks with random Gaussian input. I am sorry about that and raise my score.


Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions

Neural Information Processing Systems

We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks with quadratic activation function in the overparametrized regime where the layer width m is larger than the input dimension d. We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width m* m. We describe how the empirical loss landscape is affected by the number n of data samples and the width m* of the teacher network. In particular we determine how the probability that there be no spurious minima on the empirical loss depends on n, d, and m*, thereby establishing conditions under which the neural network can in principle recover the teacher. We also show that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error, i.e. it enables recovery in practice.


From Optimization Dynamics to Generalization Bounds via {\L}ojasiewicz Gradient Inequality

Liu, Fusheng, Yang, Haizhao, Hayou, Soufiane, Li, Qianxiao

arXiv.org Artificial Intelligence

Optimization and generalization are two essential aspects of statistical machine learning. In this paper, we propose a framework to connect optimization with generalization by analyzing the generalization error based on the optimization trajectory under the gradient flow algorithm. The key ingredient of this framework is the Uniform-LGI, a property that is generally satisfied when training machine learning models. Leveraging the Uniform-LGI, we first derive convergence rates for gradient flow algorithm, then we give generalization bounds for a large class of machine learning models. We further apply our framework to three distinct machine learning models: linear regression, kernel regression, and two-layer neural networks. Through our approach, we obtain generalization estimates that match or extend previous results.


Interplay Between Optimization and Generalization of Stochastic Gradient Descent with Covariance Noise

Wen, Yeming, Luk, Kevin, Gazeau, Maxime, Zhang, Guodong, Chan, Harris, Ba, Jimmy

arXiv.org Machine Learning

The choice of batch-size in a stochastic optimization algorithm plays a substantial role for both optimization and generalization. Increasing the batch-size used typically improves optimization but degrades generalization. To address the problem of improving generalization while maintaining optimal convergence in large-batch training, we propose to add covariance noise to the gradients. We demonstrate that the optimization performance of our method is more accurately captured by the structure of the noise covariance matrix rather than by the variance of gradients. Moreover, over the convex-quadratic, we prove in theory that it can be characterized by the Frobenius norm of the noise matrix. Our empirical studies with standard deep learning model-architectures and datasets shows that our method not only improves generalization performance in large-batch training, but furthermore, does so in a way where the optimization performance remains desirable and the training duration is not elongated.


Optimization in Machine Learning: Robust or global minimum?

@machinelearnbot

We understand that in convex problems it is much easier to find the global optimum. We appreciate the opportunity to participate in this discussion. KD, MF: No, the convexified problem can have a minimum that is quite different from the original problem. The motivation for our paper comes from the fact that in many problems (like control and reinforcement learning) one is interested in a "robust" minimum (a minimum such that the cost does not increase much when you perturb the parameters). Our method destroys non-robust minima and preserves a single robust minimum of the problem.